Skip to content

Add sigmoid_fp64 kernels#10272

Merged
copybara-service[bot] merged 1 commit into
masterfrom
test_917603787
May 19, 2026
Merged

Add sigmoid_fp64 kernels#10272
copybara-service[bot] merged 1 commit into
masterfrom
test_917603787

Conversation

@copybara-service
Copy link
Copy Markdown
Contributor

Add sigmoid_fp64 kernels

And rewrite the sigmoid_fp32 kernel using the same technique.

It turns out that this kernel is faster, which is a little surprising. It does have less "overhead" (special cases, piecewise branches, etc.) in exchange for more polynomial arithmetic.

Change in performance for sigmoid_fp32:

name                                                                time/op        time/op     vs base               
bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:1/n:4096/real_time   1.759µ ± 3%   1.771µ ± 21%        ~ (p=0.818 n=6)
bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:4/n:1024/real_time   1.740µ ± 3%   1.774µ ±  3%        ~ (p=0.065 n=6)
bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:16/n:256/real_time   1.773µ ± 2%   1.783µ ±  1%        ~ (p=0.699 n=6)
bench/sigmoid_fp32_1x16_x86_avx2/m:1/n:4096/real_time               4.909µ ± 1%   3.670µ ±  2%  -25.24% (p=0.002 n=6)
bench/sigmoid_fp32_1x16_x86_avx2/m:4/n:1024/real_time               4.830µ ± 2%   3.706µ ±  4%  -23.28% (p=0.002 n=6)
bench/sigmoid_fp32_1x16_x86_avx2/m:16/n:256/real_time               4.912µ ± 1%   3.740µ ±  2%  -23.87% (p=0.002 n=6)
bench/sigmoid_fp32_1x32_x86_sse2/m:1/n:4096/real_time               6.632µ ± 3%   5.437µ ±  2%  -18.02% (p=0.002 n=6)
bench/sigmoid_fp32_1x32_x86_sse2/m:4/n:1024/real_time               6.637µ ± 3%   5.524µ ±  3%  -16.77% (p=0.002 n=6)
bench/sigmoid_fp32_1x32_x86_sse2/m:16/n:256/real_time               6.692µ ± 4%   5.493µ ±  1%  -17.92% (p=0.002 n=6)
geomean                                                             3.851µ        3.305µ        -14.19%

sigmoid_fp64 compared to other kernels:

----------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------
bench_reference/sigmoid_float/m:1/n:4096/real_time                     38579 ns        38571 ns         7348 Bytes=849.382M/s Op=106.173M/s
bench_reference/sigmoid_float/m:4/n:1024/real_time                     38345 ns        38338 ns         7440 Bytes=854.556M/s Op=106.819M/s
bench_reference/sigmoid_float/m:16/n:256/real_time                     39192 ns        39187 ns         7190 Bytes=836.095M/s Op=104.512M/s
bench_reference/sigmoid_double/m:1/n:4096/real_time                    91326 ns        91313 ns         3045 Bytes=717.606M/s Op=44.8504M/s
bench_reference/sigmoid_double/m:4/n:1024/real_time                    91307 ns        91290 ns         3043 Bytes=717.757M/s Op=44.8598M/s
bench_reference/sigmoid_double/m:16/n:256/real_time                    93505 ns        93486 ns         3018 Bytes=700.885M/s Op=43.8053M/s
bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:1/n:4096/real_time       1786 ns         1786 ns       155658 Bytes=18.3422G/s Op=2.29277G/s
bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:4/n:1024/real_time       1802 ns         1802 ns       157599 Bytes=18.1847G/s Op=2.27309G/s
bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:16/n:256/real_time       1791 ns         1791 ns       156134 Bytes=18.2963G/s Op=2.28704G/s
bench/sigmoid_fp64_1x16_x86_avx512f_avx512bw/m:1/n:4096/real_time       4475 ns         4475 ns        60425 Bytes=14.6433G/s Op=915.207M/s
bench/sigmoid_fp64_1x16_x86_avx512f_avx512bw/m:4/n:1024/real_time       4822 ns         4821 ns        59593 Bytes=13.5913G/s Op=849.459M/s
bench/sigmoid_fp64_1x16_x86_avx512f_avx512bw/m:16/n:256/real_time       4842 ns         4840 ns        56596 Bytes=13.5363G/s Op=846.016M/s
bench/sigmoid_fp32_1x16_x86_avx2/m:1/n:4096/real_time                   3789 ns         3788 ns        69486 Bytes=8.64752G/s Op=1.08094G/s
bench/sigmoid_fp32_1x16_x86_avx2/m:4/n:1024/real_time                   3892 ns         3892 ns        74142 Bytes=8.41825G/s Op=1.05228G/s
bench/sigmoid_fp32_1x16_x86_avx2/m:16/n:256/real_time                   3757 ns         3756 ns        72827 Bytes=8.72073G/s Op=1.09009G/s
bench/sigmoid_fp64_1x8_x86_avx2/m:1/n:4096/real_time                   10451 ns        10450 ns        26516 Bytes=6.27103G/s Op=391.939M/s
bench/sigmoid_fp64_1x8_x86_avx2/m:4/n:1024/real_time                   11010 ns        11007 ns        24451 Bytes=5.95261G/s Op=372.038M/s
bench/sigmoid_fp64_1x8_x86_avx2/m:16/n:256/real_time                   10475 ns        10472 ns        26374 Bytes=6.2567G/s Op=391.044M/s
bench/sigmoid_fp32_1x32_x86_sse2/m:1/n:4096/real_time                   5649 ns         5648 ns        49675 Bytes=5.80048G/s Op=725.06M/s
bench/sigmoid_fp32_1x32_x86_sse2/m:4/n:1024/real_time                   5646 ns         5645 ns        50916 Bytes=5.80353G/s Op=725.441M/s
bench/sigmoid_fp32_1x32_x86_sse2/m:16/n:256/real_time                   5571 ns         5571 ns        48792 Bytes=5.88151G/s Op=735.188M/s
bench/sigmoid_fp64_1x8_x86_sse2/m:1/n:4096/real_time                   15957 ns        15952 ns        17116 Bytes=4.10712G/s Op=256.695M/s
bench/sigmoid_fp64_1x8_x86_sse2/m:4/n:1024/real_time                   15657 ns        15654 ns        17451 Bytes=4.18581G/s Op=261.613M/s
bench/sigmoid_fp64_1x8_x86_sse2/m:16/n:256/real_time                   15748 ns        15744 ns        17804 Bytes=4.16163G/s Op=260.102M/s

And rewrite the sigmoid_fp32 kernel using the same technique.

It turns out that this kernel is faster, which is a little surprising. It does have less "overhead" (special cases, piecewise branches, etc.) in exchange for more polynomial arithmetic.

Change in performance for `sigmoid_fp32`:

```
name                                                                time/op        time/op     vs base
bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:1/n:4096/real_time   1.759µ ± 3%   1.771µ ± 21%        ~ (p=0.818 n=6)
bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:4/n:1024/real_time   1.740µ ± 3%   1.774µ ±  3%        ~ (p=0.065 n=6)
bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:16/n:256/real_time   1.773µ ± 2%   1.783µ ±  1%        ~ (p=0.699 n=6)
bench/sigmoid_fp32_1x16_x86_avx2/m:1/n:4096/real_time               4.909µ ± 1%   3.670µ ±  2%  -25.24% (p=0.002 n=6)
bench/sigmoid_fp32_1x16_x86_avx2/m:4/n:1024/real_time               4.830µ ± 2%   3.706µ ±  4%  -23.28% (p=0.002 n=6)
bench/sigmoid_fp32_1x16_x86_avx2/m:16/n:256/real_time               4.912µ ± 1%   3.740µ ±  2%  -23.87% (p=0.002 n=6)
bench/sigmoid_fp32_1x32_x86_sse2/m:1/n:4096/real_time               6.632µ ± 3%   5.437µ ±  2%  -18.02% (p=0.002 n=6)
bench/sigmoid_fp32_1x32_x86_sse2/m:4/n:1024/real_time               6.637µ ± 3%   5.524µ ±  3%  -16.77% (p=0.002 n=6)
bench/sigmoid_fp32_1x32_x86_sse2/m:16/n:256/real_time               6.692µ ± 4%   5.493µ ±  1%  -17.92% (p=0.002 n=6)
geomean                                                             3.851µ        3.305µ        -14.19%
```

`sigmoid_fp64` compared to other kernels:
```
----------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------
bench_reference/sigmoid_float/m:1/n:4096/real_time                     38579 ns        38571 ns         7348 Bytes=849.382M/s Op=106.173M/s
bench_reference/sigmoid_float/m:4/n:1024/real_time                     38345 ns        38338 ns         7440 Bytes=854.556M/s Op=106.819M/s
bench_reference/sigmoid_float/m:16/n:256/real_time                     39192 ns        39187 ns         7190 Bytes=836.095M/s Op=104.512M/s
bench_reference/sigmoid_double/m:1/n:4096/real_time                    91326 ns        91313 ns         3045 Bytes=717.606M/s Op=44.8504M/s
bench_reference/sigmoid_double/m:4/n:1024/real_time                    91307 ns        91290 ns         3043 Bytes=717.757M/s Op=44.8598M/s
bench_reference/sigmoid_double/m:16/n:256/real_time                    93505 ns        93486 ns         3018 Bytes=700.885M/s Op=43.8053M/s
bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:1/n:4096/real_time       1786 ns         1786 ns       155658 Bytes=18.3422G/s Op=2.29277G/s
bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:4/n:1024/real_time       1802 ns         1802 ns       157599 Bytes=18.1847G/s Op=2.27309G/s
bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:16/n:256/real_time       1791 ns         1791 ns       156134 Bytes=18.2963G/s Op=2.28704G/s
bench/sigmoid_fp64_1x16_x86_avx512f_avx512bw/m:1/n:4096/real_time       4475 ns         4475 ns        60425 Bytes=14.6433G/s Op=915.207M/s
bench/sigmoid_fp64_1x16_x86_avx512f_avx512bw/m:4/n:1024/real_time       4822 ns         4821 ns        59593 Bytes=13.5913G/s Op=849.459M/s
bench/sigmoid_fp64_1x16_x86_avx512f_avx512bw/m:16/n:256/real_time       4842 ns         4840 ns        56596 Bytes=13.5363G/s Op=846.016M/s
bench/sigmoid_fp32_1x16_x86_avx2/m:1/n:4096/real_time                   3789 ns         3788 ns        69486 Bytes=8.64752G/s Op=1.08094G/s
bench/sigmoid_fp32_1x16_x86_avx2/m:4/n:1024/real_time                   3892 ns         3892 ns        74142 Bytes=8.41825G/s Op=1.05228G/s
bench/sigmoid_fp32_1x16_x86_avx2/m:16/n:256/real_time                   3757 ns         3756 ns        72827 Bytes=8.72073G/s Op=1.09009G/s
bench/sigmoid_fp64_1x8_x86_avx2/m:1/n:4096/real_time                   10451 ns        10450 ns        26516 Bytes=6.27103G/s Op=391.939M/s
bench/sigmoid_fp64_1x8_x86_avx2/m:4/n:1024/real_time                   11010 ns        11007 ns        24451 Bytes=5.95261G/s Op=372.038M/s
bench/sigmoid_fp64_1x8_x86_avx2/m:16/n:256/real_time                   10475 ns        10472 ns        26374 Bytes=6.2567G/s Op=391.044M/s
bench/sigmoid_fp32_1x32_x86_sse2/m:1/n:4096/real_time                   5649 ns         5648 ns        49675 Bytes=5.80048G/s Op=725.06M/s
bench/sigmoid_fp32_1x32_x86_sse2/m:4/n:1024/real_time                   5646 ns         5645 ns        50916 Bytes=5.80353G/s Op=725.441M/s
bench/sigmoid_fp32_1x32_x86_sse2/m:16/n:256/real_time                   5571 ns         5571 ns        48792 Bytes=5.88151G/s Op=735.188M/s
bench/sigmoid_fp64_1x8_x86_sse2/m:1/n:4096/real_time                   15957 ns        15952 ns        17116 Bytes=4.10712G/s Op=256.695M/s
bench/sigmoid_fp64_1x8_x86_sse2/m:4/n:1024/real_time                   15657 ns        15654 ns        17451 Bytes=4.18581G/s Op=261.613M/s
bench/sigmoid_fp64_1x8_x86_sse2/m:16/n:256/real_time                   15748 ns        15744 ns        17804 Bytes=4.16163G/s Op=260.102M/s
```

PiperOrigin-RevId: 918081537
@copybara-service copybara-service Bot merged commit cc68da8 into master May 19, 2026
@copybara-service copybara-service Bot deleted the test_917603787 branch May 19, 2026 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant